On Lower Bounds for Regret in Reinforcement Learning
نویسندگان
چکیده
We consider the problem of learning to optimize an unknown MDP M∗ = (S,A, R∗, P ∗). S = {1, .., S} is the state space, A = {1, .., A} is the action space. In each timestep t = 1, 2, .. the agent observes a state st ∈ S, selects an action at ∈ A, receives a reward rt ∼ R(st, at) ∈ [0, 1] and transitions to a new state st+1 ∼ P (st, at). We define all random variables with respect to a probability space (Ω,F ,P).
منابع مشابه
Near-optimal Regret Bounds for Reinforcement Learning Near-optimal Regret Bounds for Reinforcement Learning
This technical report is an extended version of [1]. For undiscounted reinforcement learning in Markov decision processes (MDPs) we consider the total regret of a learning algorithm with respect to an optimal policy. In order to describe the transition structure of an MDP we propose a new parameter: An MDP has diameter D if for any pair of states s, s there is a policy which moves from s to s i...
متن کاملNear-optimal Regret Bounds for Reinforcement Learning
For undiscounted reinforcement learning in Markov decision processes (MDPs) we consider the total regret of a learning algorithm with respect to an optimal policy. In order to describe the transition structure of an MDP we propose a new parameter: An MDP has diameter D if for any pair of states s, s′ there is a policy which moves from s to s′ in at most D steps (on average). We present a reinfo...
متن کاملLogarithmic Online Regret Bounds for Undiscounted Reinforcement Learning
We present a learning algorithm for undiscounted reinforcement learning. Our interest lies in bounds for the algorithm’s online performance after some finite number of steps. In the spirit of similar methods already successfully applied for the exploration-exploitation tradeoff in multi-armed bandit problems, we use upper confidence bounds to show that our UCRL algorithm achieves logarithmic on...
متن کاملImproved Regret Bounds for Undiscounted Continuous Reinforcement Learning
We consider the problem of undiscounted reinforcement learning in continuous state space. Regret bounds in this setting usually hold under various assumptions on the structure of the reward and transition function. Under the assumption that the rewards and transition probabilities are Lipschitz, for 1-dimensional state space a regret bound of Õ(T 3 4 ) after any T steps has been given by Ortner...
متن کاملAdaptive aggregation for reinforcement learning in average reward Markov decision processes
We present an algorithm which aggregates online when learning to behave optimally in an average reward Markov decision process. The algorithm is based on the reinforcement learning algorithm UCRL and uses confidence intervals for aggregating the state space. We derive bounds on the regret our algorithm suffers with respect to an optimal policy. These bounds are only slightly worse than the orig...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1608.02732 شماره
صفحات -
تاریخ انتشار 2016